Rapid side-channel access to storage devices

ABSTRACT

Disclosed are systems, methods, and apparatuses for providing a high-speed data path to storage devices. In one embodiment, a method is disclosed comprising receiving, by the processor, a data access command, the data access command specifying a location in memory to access data; issuing, by the processor, the data access command to the storage device via a first datapath, the first datapath comprising a non-block datapath; and accessing, by the processor, the non-volatile storage component through the first datapath and the memory, wherein the non-volatile storage component of the storage device is mapped to memory accessible by the processor.

This application includes material that is subject to copyrightprotection. The copyright owner has no objection to the facsimilereproduction by anyone of the patent disclosure, as it appears in thePatent and Trademark Office files or records, but otherwise reserves allcopyright rights whatsoever.

BACKGROUND

The disclosed embodiments relate to storage devices and, specifically,to systems, devices, and methods for improving latency of critical datatransfers to and from storage devices.

Traditionally, storage devices employed block-based data paths and fileaccess mechanisms. Data written to, and read from, storage devices passthrough multiple layers of processing to convert a system call (e.g., afile path-based request) to a block-based command processible by thestorage device. In essence, these layers translate application-layerreads/writes to block-based reads/writes understandable by theunderlying storage fabric (e.g., NAND Flash).

Block-based access to storage devices provides advantages of scale. Datacan be read/written in larger blocks versus bit or byte-level access.Thus, larger data items can be written in a smaller number of blocks.However, the use of block-based file access patterns negatively impactsalternative input/output (I/O) patterns.

In many devices, metadata is read/written in addition to actual data.This metadata may comprise, for example, journaling or log data to bewritten to the storage device. This metadata is usually small,significantly smaller than the actual data. Often, this metadata is alsoon the critical path of execution for a file access. That is, themetadata must be processed prior to reading/writing the actual data.

Current devices that utilize block-based processing, subject these smallmetadata accesses to the same block-based processing as all data. Thusthe metadata is processed by multiple layers of the client devicesfile-handling software. This additional processing can add significanttime to the latency of the request. Additionally, the additionalprocessing is often unnecessary and unsuitable for random accessreads/writes. For example, block-processing results in an approximately10 microsecond latency time while the latency of a PCI bus isapproximately 1 microsecond. When incurring this latency for criticalpath metadata access, the entire file access is significantly impactedas the file access must await the results of the metadata access.

Thus, there exists a need in current systems to improve the latencyassociated with critical path I/O such as metadata accesses.

BRIEF SUMMARY

The disclosed embodiments solve these and other problems by establishinga side-channel datapath to allow for rapid reads/writes while preservingtraditional block-based I/O channels, for example by providing ahigh-speed data channel to, for example, PCI and NVMe-based storagedevices.

In one embodiment, a method is disclosed comprising mapping, by aprocessor, a non-volatile storage component of a storage device tomemory accessible by the processor; receiving, by the processor, a dataaccess command, the data access command specifying a location in memoryto access data; issuing, by the processor, the data access command tothe storage device via a first datapath, the first datapath comprising anon-block datapath; and accessing, by the processor, the non-volatilestorage component through the first datapath and the memory.

In another embodiment, a system is disclosed comprising a storagedevice, the storage device including a non-volatile storage component; aprocessor; a memory for tangibly storing thereon program logic forexecution by the processor, the stored program logic comprising: logic,executed by the processor, for mapping the non-volatile storagecomponent to the host memory; logic, executed by the processor, forreceiving a data access command, the data access command specifying alocation in host memory to access data; logic, executed by theprocessor, for issuing the data access command to the storage device viaa first datapath, the first datapath comprising a non-block datapath;and logic, executed by the processor, for accessing the non-volatilestorage component through the first datapath and the memory.

As described above, existing solutions use a block-based data channelfor all data accesses between storage devices and host devices. The useof block-based channels for smaller data accesses (e.g., metadatareads/writes, journaling, caching, etc.) is inefficient and blocks theoverall I/O of the storage device. The disclosed embodiments describe atechnical solution to this problem by introducing a high throughput,memory-mapped datapath that allows for rapid access to a non-volatilestorage device located in the storage device controller (e.g., existingDRAM or dedicated SCM). The disclosed embodiments result in a datapathhaving a significantly faster datapath for critical requests, improvingthe throughput of these requests from ten microseconds to approximatelyone microsecond.

BRIEF DESCRIPTION OF THE DRAWINGS

The preceding and other objects, features, and advantages of thedisclosure will be apparent from the following description ofembodiments as illustrated in the accompanying drawings, in whichreference characters refer to the same parts throughout the variousviews. The drawings are not necessarily to scale, emphasis instead beingplaced upon illustrating principles of the disclosure.

FIG. 1 is a block diagram illustrating a computing system according tosome embodiment of the disclosure.

FIG. 2 is a block diagram illustrating a computing system with anenhanced metadata datapath according to some embodiments of thedisclosure.

FIG. 3 is a block diagram illustrating a computing system with anenhanced metadata datapath according to some embodiments of thedisclosure.

FIG. 4 is a flow diagram illustrating a method for accessing data in astorage class memory device using a memory-mapped address according tosome embodiments of the disclosure.

FIG. 5 is a flow diagram illustrating a method for processing storagedevice operations according to some embodiments of the disclosure.

FIG. 6A is a flow diagram of a method for performing journaling by afilesystem according to some embodiments of the disclosure.

FIG. 6B is a flow diagram of a method for implementing a byteaddressable write buffer in local memory of a storage device accordingto some embodiments of the disclosure.

FIG. 6C is a flow diagram of a method for implementing a byteaddressable read cache in local memory of a storage device according tosome embodiments of the disclosure.

FIG. 6D is a flow diagram of an alternative method for implementing abyte addressable read cache in local memory of a storage deviceaccording to some embodiments of the disclosure.

FIG. 7 is a hardware diagram illustrating a device for accessing anobject storage device according to some embodiments of the disclosure.

DETAILED DESCRIPTION

The present disclosure will now be described more fully hereinafter withreference to the accompanying drawings, which form a part hereof, andwhich show, by way of illustration, certain example embodiments. Subjectmatter may, however, be embodied in a variety of different forms and,therefore, covered or claimed subject matter is intended to be construedas not being limited to any example embodiments set forth herein;example embodiments are provided merely to be illustrative. Likewise, areasonably broad scope for claimed or covered subject matter isintended. Among other things, for example, subject matter may beembodied as methods, devices, components, or systems. Accordingly,embodiments may, for example, take the form of hardware, software,firmware or any combination thereof (other than software per se). Thefollowing detailed description is, therefore, not intended to be takenin a limiting sense.

Throughout the specification and claims, terms may have nuanced meaningssuggested or implied in context beyond an explicitly stated meaning.Likewise, the phrase “in one embodiment” as used herein does notnecessarily refer to the same embodiment and the phrase “in anotherembodiment” as used herein does not necessarily refer to a differentembodiment. It is intended, for example, that claimed subject matterinclude combinations of example embodiments in whole or in part.

In general, terminology may be understood at least in part from usage incontext. For example, terms, such as “and”, “or”, or “and/or,” as usedherein may include a variety of meanings that may depend at least inpart upon the context in which such terms are used. Typically, “or” ifused to associate a list, such as A, B or C, is intended to mean A, B,and C, here used in the inclusive sense, as well as A, B or C, here usedin the exclusive sense. In addition, the term “one or more” as usedherein, depending at least in part upon context, may be used to describeany feature, structure, or characteristic in a singular sense or may beused to describe combinations of features, structures or characteristicsin a plural sense. Similarly, terms, such as “a,” “an,” or “the,” again,may be understood to convey a singular usage or to convey a pluralusage, depending at least in part upon context. In addition, the term“based on” may be understood as not necessarily intended to convey anexclusive set of factors and may, instead, allow for the existence ofadditional factors not necessarily expressly described, again, dependingat least in part on context.

The present disclosure is described below with reference to blockdiagrams and operational illustrations of methods and devices. It isunderstood that each block of the block diagrams or operationalillustrations, and combinations of blocks in the block diagrams oroperational illustrations, can be implemented by means of analog ordigital hardware and computer program instructions. These computerprogram instructions can be provided to a processor of a general-purposecomputer to alter its function as detailed herein, a special purposecomputer, ASIC, or other programmable data processing apparatus, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, implement thefunctions/acts specified in the block diagrams or operational block orblocks. In some alternate implementations, the functions/acts noted inthe blocks can occur out of the order noted in the operationalillustrations. For example, two blocks shown in succession can in factbe executed substantially concurrently or the blocks can sometimes beexecuted in the reverse order, depending upon the functionality/actsinvolved.

These computer program instructions can be provided to a processor of: ageneral purpose computer to alter its function to a special purpose; aspecial purpose computer; ASIC; or other programmable digital dataprocessing apparatus, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, implement the functions/acts specified in the block diagramsor operational block or blocks, thereby transforming their functionalityin accordance with embodiments herein.

For the purposes of this disclosure a computer-readable medium (orcomputer-readable storage medium/media) stores computer data, which datacan include computer program code (or computer-executable instructions)that is executable by a computer, in machine-readable form. By way ofexample, and not limitation, a computer-readable medium may comprisecomputer-readable storage media, for tangible or fixed storage of data,or communication media for transient interpretation of code-containingsignals. Computer-readable storage media, as used herein, refers tophysical or tangible storage (as opposed to signals) and includeswithout limitation volatile and non-volatile, removable andnon-removable media implemented in any method or technology for thetangible storage of information such as computer-readable instructions,data structures, program modules or other data. Computer-readablestorage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM,flash memory or other solid-state memory technology, CD-ROM, DVD, orother optical storage, magnetic cassettes, magnetic tape, magnetic diskstorage or other magnetic storage devices, or any other physical ormaterial medium which can be used to tangibly store the desiredinformation or data or instructions and which can be accessed by acomputer or processor.

FIG. 1 is a block diagram illustrating a computing system according tosome embodiment of the disclosure.

In the illustrated embodiment, an application (101) reads, writes, andmodifies data stored on NAND devices (123). Access to NAND devices (123)is mediated by a controller (111) and the requests between NAND devices(123) and application (101) are processed by various layers (103, 105,107, 109) of an operating system installed on a device (100).

Application (101) comprises any application running on device (100). Forexample, application (101) may comprise a user-level application (e.g.,a web browser, database, word processor, etc.). Alternatively, or inconjunction with the foregoing, application (101) may comprise asystem-level application (e.g., a utility application, a shell, etc.).In general, application (101) comprises any software application thatreads, writes, or modifies data stored in NAND devices (123).

An application (101) may access data in the NAND devices (123) byissuing system calls provided by an operating system of the device(100). For example, in a LINUX environment, the application (101) mayissue CLOSE, CREAT, OPEN, MKDIR, LINK, CHOWN, and various other systemcalls to operate on filesystem data. The specific system calls used arenot intended to be limited and any system calls on any operating systemmay be called by application (101).

A filesystem (103) receives the system calls issued by application(101). The filesystem (103) may comprise a standard filesystem (e.g.,ext3) or other filesystem (e.g., a FUSE-based filesystem) capable ofhandling system calls for a given file. As known in the art, thefilesystem may be dynamically loaded based on the underlying file beingaccessed or the partition/drive being accessed. The system call includesa payload specific to the type of system call (e.g., a filename and datato be written). The filesystem (103) performs various operations tohandle the system call.

After processing the system call, the filesystem (103) forwards the datato a block layer (105). The block layer (105) is an abstraction overvarious block-based devices supported by the system. While the blocklayer (105) may perform various functions (e.g., scheduling), oneprimary function of the block layer (105) is to convert the bitstreamfrom the system call into block-level commands appropriate for theunderlying storage device.

The block layer (105) forwards the block data to an NVMe driver (107).Other drivers may be used in place of NVMe, and the NVMe driver (107) isused primarily as an example of a bus protocol. The NVMe driver (107) orsimilar driver converts the block-level commands into commandsconforming to the NVMe protocol (or suitable alternative protocol). Forexample, the NVMe driver (107) may place the various blocks into one ormore deep queues.

Finally, the NVMe-formatted data is sent to the PCIe (PeripheralComponent Interconnect Express) driver (109) which performs a finalconversion of the data in accordance with the PCIe protocol. In otherembodiments, PCI (Peripheral Component Interconnect), PCI-X, or otherbus protocols may be used. After PCIe-formatting, the device (100)transmits the data over a bus (119) to the storage controller (111). Inone embodiment, the bus (119) comprises an NVMe bus, while other bussesmay be selected based on the protocols used by the system.

Controller (111) receives the NVMe blocks via host interface (113). Inone embodiment, host interface (113) is configured to process PCIepackets and extract the underlying NVMe blocks from the PCIe packets.Firmware (115) is used to process the unwrapped packets in accordancewith the NVMe protocol (or similar protocol). In one embodiment, thefirmware (115) extracts a logical block address (LBA) from the NVMeblock and converts it to a physical block address (PBA). This conversionis facilitated by DRAM (121) stored locally with the controller (111).The DRAM (121) stores a mapping of LBAs to PBAs and thus is used toconvert the received LBA to a PBA of one of the NAND devices (123). Thefirmware then converts the LBA-based command to a PBA-based command andissues the command to the NAND devices (123) via the NAND interface(117) and channel (125). The NAND interface (117) processes thePBA-based command and generates the NAND signaling required by NANDdevices (123) to access the underlying PBA.

NAND devices (123) comprise one or more NAND packages, each packagecomprising a plurality of blocks, each block comprising a plurality ofpages. In some embodiments, other non-volatile memory (NVM) may be usedin place of NAND memory.

When retrieving data, the aforementioned process is executed in reverseand specific details are not included again for the sake of clarity.NAND devices (123) return data at PBAs to NAND interface (117). The NANDinterface (117) processes and forwards the data to firmware (115). Inresponse, the firmware (115) converts the PBA of the data to an LBAusing DRAM (121) and transmits the response to host interface (113). Inone embodiment, host interface (113) reformats the data according to theprotocols used by the system (e.g., PCIe and NVMe) and transmits thedata over the bus (119) to the device (100). The device (100) processesthe PCIe packets using PCIe driver (109) and then processes theresulting packets using the NVMe driver (107). The resulting data isreturned to the block layer (105) which performs block operations on thereturned data and transmits the data to the filesystem (103) whichreturns the appropriate data in response to a system call issued byapplication (101).

As can be seen, all of the aforementioned operations are block-based. Ablock is a fixed size of data utilize by the system. When operating in ablock-based mode, the size, and type of the underlying data are notconsidered when determining the size of the block, as a standard blocksize is used. Thus, even if the data to be read or to be written issmall, an entire block is allocated and processed by the system througheach layer of the system. As one example, metadata read and writesgenerally comprise small, random access requests to the underlyingstorage device. The overhead involved in block-processing these types ofrequests inherently slows down the response time of the system.Additionally, metadata access is often on the “critical path” ofexecution. That is, access to metadata frequently results in blockingthe operations on underlying data associated with the metadata.Moreover, the amount of processing employed by the various layers of adevice (100) adds significant processing time to each storage operation(e.g., in the 10s of microseconds). This processing time is incurred dueto the overhead of block operations (layers 103 and 105) as well as NVMe(layer 107) processing. In contrast, the standalone PCIe processingoverhead is around 1 microsecond. Thus, manipulating metadata using ablock-based interface results in a significant slowdown of the entiresystem.

Various techniques have been proposed to remedy this deficiency. Forexample, recent NVMe revisions have added a controller memory buffer(CMB) feature that exposes internal buffers of storage devices to thedevice (100) address space. The device (100) can then access the CMB asa memory-mapped location. However, the CMB is primarily designed toaccelerate NVMe command processing and has several limitations. First,the size of the CMB is significantly small as most of the DRAM (121) isused to map LBAs to PBAs. Second, the CMB is a volatile memory device(e.g., DRAM or SRAM) and thus loses all data during power loss orcycling. These two features, while useful for command processing, makethe CMB unsuitable for metadata access or storage.

FIG. 2 is a block diagram illustrating a computing system with anenhanced metadata datapath according to some embodiments of thedisclosure.

In the illustrated embodiment, various components (e.g., 100, 101, 103,105, 107, 109, 111, 113, 115, 117, 119, 121, and 123) may performsimilar operations as those described in FIG. 1 and the description ofthose functions is incorporated by reference. Specifically, the systemillustrated in FIG. 2 may perform block-based transfer of data to andfrom device (100) and NAND devices (123) (e.g., via NVMe bus (119)).However, the illustrated embodiment additionally provides a streamlinedmetadata datapath and mode of operating the same.

In the illustrated embodiment, the controller (111) includes anadditional storage class memory (SCM) device (201). The SCM (201) maycomprise a phase change memory (PCM). PCM is a non-volatile memory typethat uses different physical phases to represent data. Alternatively, orin conjunction with the foregoing, the SCM (201) may comprise aresistive random-access memory (ReRAM). ReRAM is a non-volatile memorytype that use different resistances to represent data. Alternatively, orin conjunction with the foregoing, the SCM (201) may comprise aspin-transfer torque RAM (STT-RAM). STT-RAM is a type of non-volatilememory that uses different magnetic orientations to represent data.Other types of memory may be used in the SCM (201). In one embodiment,the type of memory only must support byte-addressing, low latency, largecapacities, and non-volatility.

In the illustrated embodiment, SCM (201) is embedded within thecontroller (111). In one embodiment, the SCM (201) may be mounted to aboard of the controller and communicatively coupled to the firmware(115). In the illustrated embodiment, the SCM (201) is included on thecontroller in addition to DRAM (121).

The SCM (201) is exposed to the client system via the persistent CMB(203) of the controller (111). In one embodiment, the number of CMBsused may vary based on the implementation. In general, a CMB is a PCIebase address register (BAR), or region within a BAR that is used tostore data associated with an NVMe block commands. In one embodiment,the CMB is directly mapped to a memory location of the device (100).Thus, the client (100) can read and write to the CMB via direct memoryaccess.

In traditional devices, the CMB is mapped to a region of DRAM and isused primarily for command processing. However, in the illustratedembodiment, the CMB is mapped to SCM (201) and not to DRAM (121). Thus,the controller exposes the SCM (201) directly to the device (100) via anassigned memory location on the client device. Since the SCM (201) ispersistent, by mapping the SCM (201) to the memory of the device (100)via the CMB (203), the device (100) can access the SCM (201) as externalpersistent memory. Additionally, since the SCM (201) is independent fromDRAM (121), the SCM (201) can be sized according to needs of thecontroller and can be sized to store as much, or as little data, asdetermined.

As illustrated, data to and from the SCM (201) uses a direct memoryaccess (DMA) datapath (205, 207). In the illustrated embodiment, data iswritten by an application (101) to a memory-mapped file (e.g., usingmmap on a LINUX system). The filesystem (103) maps the file to the SMC(201) via the PCIe driver (205). Thus, when writing to the memory-mappedfile, metadata can be transferred directly to the SCM (201) withoutaccessing the block layer (105) and NVMe driver (107), significantlyreducing the latency. In one embodiment, the SCM (201) is exposed to theapplication (101) via a predefined memory-mapped file that may beaccessed by the application (101). Notably, as illustrated, theapplication (101) may still issue block-based commands for non-metadataoperations.

FIG. 3 is a block diagram illustrating a computing system with anenhanced metadata datapath according to some embodiments of thedisclosure.

In the illustrated embodiment, various components (e.g., 100, 101, 103,105, 107, 109, 111, 113, 115, 117, 119, 121, and 123) may performsimilar operations as those described in FIG. 1 and the description ofthose functions is incorporated by reference.

In the illustrated embodiment, the controller (111) includes one or morePCI BARs. PCI BARs are used to address a PCI device implementing thecontroller. A given BAR must be enabled by being mapped into the I/Oport address space or memory-mapped address space of the device (100).Host software on device (100) programs the BARs of the controller (111)to inform the PCI device of its address mapping by writing configurationcommands to the controller (111). Each PCI device may have six BARs,each comprising a 32-bit address. Alternatively, or in conjunction withthe foregoing, the BARs may comprise three 64-bit addresses. If thedevice has memory-mapped space (e.g. internal DRAM) it can be mapped tohost address space so that host software can access the device withmemory load/store operations.

In the illustrated embodiment, a portion of DRAM (121) is assigned to aPCI BAR (301). Thus, when the system maps the PCI BAR window into itsaddress space, the host can use BAR (301) as a byte-addressableread/write window. Using this BAR (301), application (101) can accessdata through two interfaces. In a first interface, the application (101)accesses the NAND devices (123) through the standard block-based NVMeinterface (e.g., datapath 119). In a second interface, the application(101) accesses the DRAM (121) directly using the “memory mode” throughbyte-addressable PCI BAR (301) (e.g., via database 303, 305). With thishybrid design, the application (101) can choose the optimal interfaceaccording to its I/O pattern. For example, high-priority orlatency-sensitive requests may be served through BAR window (301), whilelow-priority or throughput-oriented requests may be served throughblock-based NVMe bus (119).

As described above, DRAM (121) is utilized by the controller (111) tostore an LBA-to-PBA mapping. Thus, a portion of the DRAM (121) may beisolated to perform LBA/PBA matching. The remaining portion of DRAM(121) may be utilized as a cache or buffer. Sizing of the cache orbuffer may be enforced by firmware (115) via the PCI BAR (301). That is,the PCI BAR (301) may point to the first free memory location in DRAM(121). In some embodiments, multiple PCI BARs may be mapped to DRAM(121). In this embodiment, a first PCI BAR may be mapped to the firstportion of free space in DRAM (121) while a second PCI BAR may be mappedto a DRAM (121) address midway within the free space.

In one embodiment, the BAR-mapped portion of DRAM (121) may be used as abye addressable read cache. This cache can provide near sub-microsecondlatency and significantly improve Quality-of-Service (QoS) of readrequests via the datapath (303, 305). For example, reading data from aNAND channel that is actively serving program or erase command can causea long delay. Some mechanisms like “program suspend” can mitigate thisdelay, but these mechanisms introduce extra complexity to firmware (115)and are dependent on NAND support. Instead, by using the BAR (301)firmware can pre-fetch hot data from a channel before issuing aprogram/erase command, so that conflicting read requests can be servedthrough the BAR (301). In another embodiment, the BAR (301) allows forreading of data from an “open” block (i.e. a block that has not beenfully programmed yet). Due to NAND characteristics, such read requestswill suffer from longer latency caused by higher bit error rate. Bycaching data from open blocks in BAR (301), read requests will not beaffected by such NAND limitation.

In a second embodiment, the BAR-mapped portion of DRAM (121) may be usedas a byte addressable write buffer. In this embodiment, the BAR (301) isused to serve latency-sensitive requests such as write-ahead logging andjournaling. In these cases, the device (100) first configures the writebuffer with a certain threshold. It then appends data in the writebuffer (along with LBAs, if data is not contiguous). Once data in thewrite buffer reaches a threshold, firmware (115) automatically “flushes”data back to NAND devices (123). Depending on the sizes of DRAM (121)and power-loss capacitors, multiple write buffers may be configured inthe BAR (301), allowing multiple write streams.

FIG. 4 is a flow diagram illustrating a method for accessing data in astorage class memory device using a memory-mapped address according tosome embodiments of the disclosure.

In step 402, the method maps a non-volatile storage device to a hostmemory location.

In one embodiment, the non-volatile storage device comprises a portionof DRAM located in a controller of a non-volatile storage device. Inanother embodiment, the non-volatile storage device comprises an SCMmodule installed on a controller of a storage device. In the illustratedembodiment, mapping a non-volatile storage device comprises mapping thedevice using a PCI BAR window of the storage device. In one embodiment,the PCI BAR window comprises a CMB window of the storage device.

In step 404, the method receives a data access command.

In one embodiment, a data access command refers to any command thatreads, writes, or otherwise manipulates data on a storage device.Examples of data access commands include read, write, delete, changeattributes, and other types of commands. In one embodiment, the dataaccess command comprises a system call issued by a host application. Ineach embodiment, the data access command may include a filepath orfilename. In some embodiments, write commands or update commands includedata to written.

In step 406, the method determines if the data access command iscritical or non-critical.

The method may determine whether a command is critical may be performedin various ways. In one embodiment, an operating system providesseparate system calls for critical data access. In one embodiment, theseseparate system calls may comprise standard POSIX system calls prefixedwith a defined prefix (e.g., “C_OPEN” for “critical” open). In anotherembodiment, POSIX systems calls may be modified to include a criticalflag to indicate that the call is critical.

In other embodiments, no modification to the operating system isrequired. Instead, one or more filepaths may be memory-mapped tocritical locations of the storage device. For example, as describedabove, the CMB mechanism of a solid-state device (SSD) enables themapping of a memory region to a memory-mapped storage area of the hostdevices memory. When initializing the storage device, the driver (e.g.,PCIe driver) of the device may map a logical storage device to the CMBlocation. In this embodiment, in step 406 the method inspects thefilepath of the data access command and automatically classifies thecommand as critical if the command is accessing the critical storagelocation mapped during initialization.

In other embodiments, the payload of the data access command mayadditionally be used to classify the command as critical ornon-critical. For example, the method may analyze the payload of a writecommand and determine if the payload is lower than a pre-determinedthreshold. In this example, the threshold may be suitably small toensure that only small data access payloads trigger the criticaldetermination. Thus, by using the threshold, the method can route allsmaller I/O accesses to a critical path.

In step 408, the method issues block-based command if the data accesscommand is non-critical.

As described above, the method may proceed in converting the system callto a block-based access command. In some embodiments, this processentails passing the system call through a block layer, NVMe driver, andPCIe driver, before the command is transmitted to a controller of astorage device.

In step 410, the method identifies a mapped address location uponclassifying the data access command as a critical access.

In one embodiment, the method identifies the host device memory locationmapped to the CMB or BAR location of the storage device. On startup, thehost device reads the BARs of the storage device to identify how manybytes should be mapped between the host device and the storage deviceand what type of mapping is required (e.g., memory access). The hostdevice allocates this memory and writes to the BARs, the data writtencomprising the address in host memory the PCI device will respond.

In one embodiment, the method computes the mapped address based on thefilepath included in the data access command as well as any offsetsincluded the data access command. Thus, if the command requests to write8 bytes to the mapped file at an offset of 6 bytes, the method willcompute the memory address 6 bytes from the start of the mapped initialaddress and use this address as the mapped address location for memorywriting.

In step 412, the method access local memory at the mapped location.

In one embodiment, accessing the local memory comprises issuing memoryread, write, and other commands. Since the local memory addresses aremapped to the internal location within the storage device, accesses tothese local memory locations are forwarded to the corresponding CMB orBAR address and, ultimately, to the underlying storage device (SCM orDRAM).

FIG. 5 is a flow diagram illustrating a method for processing storagedevice operations according to some embodiments of the disclosure.

In step 502, the method receives a memory-mapped I/O packet.

In one embodiment, this packet comprises a PCIe packet. In the previousfigure, a packet may be generated by the PCIe driver. For example,accessing a memory location in step 410 may comprise writing a value toa specific memory location. Both the physical (or virtual) system memorylocation and the data is encapsulated in the packet, along withadditional metadata supplemented by the PCIe driver. In one embodiment,the method may additionally tag the packet as critical or non-critical(e.g., using the Tag field of the PCI TLP packet format).

In step 504, the method extracts the host address (i.e., the localmemory address) from the data package and, in step 506, translates thehost address into a local storage device address.

As described above, the modified storage device supports bothblock-based I/O access as well as memory I/O. In one embodiment, thetranslation processing step 506 comprises translating the host addressto a memory location in an SCM storage location. In another embodiment,the translation comprises translating the host address to a memorylocation in the storage devices DRAM. The determination of whichtranslation to apply may be configured based on the BAR configuration ofthe storage device.

In step 506, the method executes the operation on the device's memorydevice.

In one embodiment, the method performs reads, writes, and otheroperations on the SCM embedded within the device. Alternatively, theoperations may comprise read, writes, and other operations to beperformed on local DRAM.

FIG. 6A is a flow diagram of a method for performing journaling by afilesystem according to some embodiments of the disclosure.

In step 602, the method receives a file access command. In oneembodiment, this command may comprise any system call issued by anapplication. In general, the system call represents data to be read,written, or otherwise accessed by the application. In the illustratedembodiment, the data is stored on a storage device such as an SSD.

In step 604, the method writes a journal entry to a location in hostmemory.

In one embodiment, step 604 may be performed using memory store and loadcommands. As described above, a storage location on the underlying SSDis mapped (via PCI BARs) to the host memory, thus in step 604, themethod reads/writes to these memory locations and transfers the journalentries to the underlying storage device (e.g., an SCM or dedicated DRAMportion of the SSD controller). The specific form of the journal entryis not intended to be limited. Indeed, the specific form of journalentries may be dependent on the underlying filesystem used by thestorage device.

In step 606, the method issues the file command using block I/O.

The issuance of a block I/O command may be performed only afterreceiving a message that the journaling in step 604 succeeds. Since thejournal entry is written using memory load/store commands, this messageis returned significantly quicker than existing systems that use blockI/O for all journal entry writes. As described above, the block I/Ocommands are processed by various layers of the host device (e.g.,filesystem drivers, NVMe drivers, PCIe drivers, etc.) and subsequentlyexecuted on the underlying storage device.

FIG. 6B is a flow diagram of a method for implementing a byteaddressable write buffer in local memory of a storage device accordingto some embodiments of the disclosure.

In step 608, host software sizes the write window.

In one embodiment, host software sizes the write window based on thesize of DRAM backing the write window as described in connection withFIG. 3. Alternatively, or in conjunction with the foregoing, the methodmay additionally size the write window based on the power losscapacitance supplied to the DRAM and/or controller of the SSD. In oneembodiment, the power loss capacitance represents the amount of chargestored by the controller/DRAM in the event of power loss. In oneembodiment, this value is stored in the PCI configuration space of thestorage device. The power loss capacitance is used to calculate themaximum size of the write window, the maximum size comprising thelargest amount of data that can be transferred from DRAM to persistentstorage while operating on the charge stored by the power losscapacitor(s). In some embodiments, a single write window may be sized.Alternatively, the method may size multiple write windows. In thisembodiment, each write window may be assigned to a memory location onthe host device (via PCI BARs). Thus, the method may support multiplewrite channels.

In step 610, the method identifies critical data

In one embodiment, the identification of critical data is performed byhost application software. Details of identifying critical data aredescribed in FIGS. 4 and 5 and the description of these Figures isincluded by reference herein. As one example, critical data may comprisehigh-value data that the application must persist as quickly aspossible. As another example, critical data may comprise journalingdata.

In step 612, the method writes the critical data to a memory location.

In one embodiment, writing the critical data to a memory locationcomprises the host application issuing a “memory store” system call andincluding the critical data as part of the memory store. As describedabove, the host memory location may be mapped to DRAM of the storagedevice via a PCI BAR mechanism. Thus, the memory store results in thehost device transferring the critical data to the SSD via a PCIe driver(or similar driver), thus bypassing block-level processing.

In step 614, the method appends the critical data to the write window.

In one embodiment, the write window refers to a contiguous section ofthe DRAM used by the SSD device on the controller. In the illustratedembodiment, the write window acts as a write-append log. Thus all datais written to the last available DRAM memory location. Thus, in thisembodiment, the method does not need to translate memory locations asperformed in FIG. 5.

In step 616, the method flushes the write window if the window'scapacity is reached.

As described above, the write window may have a fixed capacity based on,for example, the size of the DRAM and/or the power loss capacitance ofthe device. In one embodiment, after each append to the write window,firmware of the SSD device inspects the write window allocated in DRAMto determine if the write window is full. Alternatively, the firmwaremay compare the current append address to the size of the write windowto determine if the write window is full.

If the write window is full, the firmware initiates a flush procedure toempty to the write window. In one embodiment, the flush proceduretransfers the contents of DRAM to a persistent storage location (e.g.,on a NAND Flash array). In some embodiments, the persistent storageincludes a separate metadata storage area. In this embodiment, thefirmware appends the entire write window to the separate metadatastorage location.

FIG. 6C is a flow diagram of a method for implementing a byteaddressable read cache in local memory of a storage device according tosome embodiments of the disclosure.

In step 618, the method receives a program or erase (P/E) command.

A P/E command instructs an SSD to write and/or erase data from a page orblock, respectively, of the underlying NAND Flash array. In oneembodiment, the P/E command corresponds to an update to a page of aFlash array.

In step 620, the method pre-fetches the block containing the pageaffected by the P/E command. In one embodiment, the method issues a readcommand to retrieve all pages in a given Flash block.

In step 624, the method caches the pages in the block. In oneembodiment, the method caches each page by writing each page to aportion of DRAM dedicated to read cache operations. In one embodiment,the size of the read cache is only limited by the available DRAM space.

As illustrated step 624 and steps 626-630 may occur in parallel.

After caching the block affected by the P/E command, the method proceedsto execute the P/E command as usual.

While executing the command, however, a host device may issue a read toa page stored in the block undergoing programming/erasing. In existingsystems, this read would be required to be buffered until theprogram/erase cycle was complete. In some systems, a “program suspend”function may be used to suspend the programming operation. However, thisfeature introduces significant complexity into the firmware of the SSD.In contrast, the method receives the conflicting read in step 626 andretrieves the corresponding page from cache in step 628. Finally, themethod returns the cached data 630 to the requesting application.

As discussed above, the cache in DRAM includes all pages from the blockundergoing programming and/or erasing in step 624. Thus, when the methodreceives the conflicting read command, the method retrieves therequested page from the read cache in DRAM. In one embodiment, aseparate memory address to LBA mapping is used to map the received LBAfrom the P/E command to a DRAM memory address.

FIG. 6D is a flow diagram of an alternative method for implementing abyte addressable read cache in local memory of a storage deviceaccording to some embodiments of the disclosure.

In step 632, the method receives a program command. In one embodiment,the program command is similar to that described in connection with FIG.6C, the disclosure of which is incorporated herein by reference in itsentirety.

In step 634, the method identifies a block associated with the page. Inone embodiment, the block identifier is identified using the LBA-to-PBAmapping included in DRAM, as discussed above.

Step 636 and steps 626 may execute in parallel.

In step 636, after caching the block affected by a program command (an“open” block), the method proceeds to execute the program operation onthe underlying block. During this process, in traditional SSDs, theblock would be marked as “open” and all reads, writes, or otheroperations would be blocked from accessing the block.

In contrast, as illustrated in FIG. 6D, the method allows for thereceipt of conflicting read requests in step 626. In one embodiment, aconflicting read request refers to a read of a page on a block that isopened and being programmed in step 636. Thus, in step 638, the methodreceives the read command. In some embodiments, the method may confirmthat the page to be read is included in the open block. The method thenretrieves the open block from cache in step 640. In some embodiments,the method may utilize a page offset to identify a specific memorylocation corresponding to the page. Finally, in step 642, the methodreturns the cached data in response to the read.

FIG. 7 is a hardware diagram illustrating a device for accessing anobject storage device according to some embodiments of the disclosure.

Client device may include many more or fewer components than those shownin FIG. 7. However, the components shown are sufficient to disclose anillustrative embodiment for implementing the present disclosure.

As shown in FIG. 7, client device includes processing units (CPUs) (702)in communication with a mass memory (704) via a bus (714). Client devicealso includes one or more network interfaces (716), an audio interface(718), a display (720), a keypad (722), an illuminator (724), aninput/output interface (726), and a camera(s) or other optical, thermalor electromagnetic sensors (728). Client device can include onecamera/sensor (728), or a plurality of cameras/sensors (728), asunderstood by those of skill in the art.

Client device may optionally communicate with a base station (notshown), or directly with another computing device. Network interface(716) includes circuitry for coupling client device to one or morenetworks and is constructed for use with one or more communicationprotocols and technologies. Network interface (716) is sometimes knownas a transceiver, transceiving device, or network interface card (NIC).

Audio interface (718) is arranged to produce and receive audio signalssuch as the sound of a human voice. For example, audio interface (718)may be coupled to a speaker and microphone (not shown) to enabletelecommunication with others and generate an audio acknowledgment forsome action. Display (720) may be a liquid crystal display (LCD), gasplasma, light emitting diode (LED), or any other type of display usedwith a computing device. Display (720) may also include a touchsensitive screen arranged to receive input from an object such as astylus or a digit from a human hand.

Keypad (722) may comprise any input device arranged to receive inputfrom a user. For example, keypad (722) may include a push button numericdial, or a keyboard. Keypad (722) may also include command buttons thatare associated with selecting and sending images. Illuminator (724) mayprovide a status indication and provide light. Illuminator (724) mayremain active for specific periods of time or in response to events. Forexample, when illuminator (724) is active, it may backlight the buttonson keypad (722) and stay on while the client device is powered. Also,illuminator (724) may backlight these buttons in various patterns whenparticular actions are performed, such as dialing another client device.Illuminator (724) may also cause light sources positioned within atransparent or translucent case of the client device to illuminate inresponse to actions.

Client device also comprises input/output interface (726) forcommunicating with external devices, such as UPS or switchboard devices,or other input or devices not shown in FIG. 7. Input/output interface(726) can utilize one or more communication technologies, such as USB,infrared, Bluetooth™, or the like.

Mass memory (704) includes a RAM (706), a ROM (710), and other storagemeans. Mass memory (704) illustrates another example of computer storagemedia for storage of information such as computer-readable instructions,data structures, program modules or other data. Mass memory (704) storesa basic input/output system (“BIOS”) (712) for controlling low-leveloperation of client device. The mass memory may also store an operatingsystem for controlling the operation of client device. It will beappreciated that this component may include a general-purpose operatingsystem such as a version of UNIX, or LINUX™, or a specialized clientcommunication operating system such as Windows Client™, or the Symbian®operating system. The operating system may include, or interface with aJava virtual machine module that enables control of hardware componentsand operating system operations via Java application programs.

Memory (704) further includes a PCIe device driver (708). As discussedabove, PCIe device driver (708) issues read, write, and other commandsto storage devices (not illustrated). In one embodiment, the PCIe devicedriver (708) issues such commands to SSD (730). In one embodiment, SSD(730) may comprise a solid-state drive or similar device that implementsone or more NAND Flash chips such as the chips described in thedescription of FIGS. 2-3 which are incorporated by reference. Ingeneral, SSD (730) comprises any NAND Flash-based device that implementsdirect memory access such as that illustrated in FIGS. 2-5 and 6A-6D.

For the purposes of this disclosure a module is a software, hardware, orfirmware (or combinations thereof) system, process or functionality, orcomponent thereof, that performs or facilitates the processes, features,and/or functions described herein (with or without human interaction oraugmentation). A module can include sub-modules. Software components ofa module may be stored on a computer readable medium for execution by aprocessor. Modules may be integral to one or more servers, or be loadedand executed by one or more servers. One or more modules may be groupedinto an engine or an application.

Those skilled in the art will recognize that the methods and systems ofthe present disclosure may be implemented in many manners and as suchare not to be limited by the foregoing exemplary embodiments andexamples. In other words, functional elements being performed by singleor multiple components, in various combinations of hardware and softwareor firmware, and individual functions, may be distributed among softwareapplications at either the client level or server level or both. In thisregard, any number of the features of the different embodimentsdescribed herein may be combined into single or multiple embodiments,and alternate embodiments having fewer than, or more than, all of thefeatures described herein are possible.

Functionality may also be, in whole or in part, distributed amongmultiple components, in manners now known or to become known. Thus,myriad software/hardware/firmware combinations are possible in achievingthe functions, features, interfaces and preferences described herein.Moreover, the scope of the present disclosure covers conventionallyknown manners for carrying out the described features and functions andinterfaces, as well as those variations and modifications that may bemade to the hardware or software or firmware components described hereinas would be understood by those skilled in the art now and hereafter.

Furthermore, the embodiments of methods presented and described asflowcharts in this disclosure are provided by way of example in order toprovide a more complete understanding of the technology. The disclosedmethods are not limited to the operations and logical flow presentedherein. Alternative embodiments are contemplated in which the order ofthe various operations is altered and in which sub-operations describedas being part of a larger operation are performed independently.

While various embodiments have been described for purposes of thisdisclosure, such embodiments should not be deemed to limit the teachingof this disclosure to those embodiments. Various changes andmodifications may be made to the elements and operations described aboveto obtain a result that remains within the scope of the systems andprocesses described in this disclosure.

What is claimed is:
 1. A method comprising: receiving, by a processor, adata access command, the data access command specifying a host addressof a memory to access data; determining, by the processor, that the dataaccess command is a critical access command based on the content of thedata access command; issuing, by the processor, the data access commandto a controller of a storage device via a first datapath, the firstdatapath comprising a non-block datapath; and accessing, by theprocessor, a non-volatile storage component of the storage devicethrough the first datapath and the memory, wherein the non-volatilestorage component is mapped to the memory accessible by the processor.2. The method of claim 1, the mapping a non-volatile storage componentto memory of the storage device to memory accessible by the processorcomprising mapping a portion of dynamic random-access memory (DRAM) tomemory.
 3. The method of claim 2, the mapping a portion of DRAM tomemory comprising mapping the portion of DRAM to memory using aPeripheral Component Interconnect (PCI) base address register (BAR) ofthe storage device.
 4. The method of claim 3, the mapping a portion ofDRAM to memory comprising identifying a size of the portion of DRAM tomap, the identifying comprising computing a maximum portion size basedon a power loss capacitance supplied to the DRAM.
 5. The method of claim3, the accessing the non-volatile storage component through the firstdatapath and the memory comprising accessing the non-volatile storagecomponent as a byte-addressable read cache or a byte-accessible writecache.
 6. The method of claim 1, the accessing the non-volatile storagecomponent through the first datapath and the memory comprising writingjournaling data of an underlying data access command to the non-volatilestorage component.
 7. The method of claim 1, the mapping a non-volatilestorage component to memory the storage device to memory accessible bythe processor comprising mapping a portion of a storage class memory(SCM) to the memory.
 8. The method of claim 7, the mapping a portion ofan SCM to memory comprising mapping one of a phase change memory (PCM),resistive random-access memory (ReRAM), or spin-transfer torque RAM(STT-RAM) to the memory.
 9. The method of claim 7, the mapping a portionof an SCM to memory comprising mapping the SCM to the memory via acontroller memory buffer (CMB) window.
 10. The method of claim 1,further comprising issuing, by the processor, a block-based data accesscommand via a second datapath after receiving a result of accessing thenon-volatile storage component through the first datapath.
 11. A systemcomprising: a storage device, the storage device including anon-volatile storage component; a processor; a memory for tangiblystoring thereon program logic for execution by the processor, the storedprogram logic comprising: logic, executed by the processor, forreceiving a data access command, the data access command specifying ahost address of a memory to access data; logic, executed by theprocessor, for determining that the data access command is a criticalaccess command based on the content of the data access command; logic,executed by the processor, for issuing the data access command to acontroller of the storage device via a first datapath, the firstdatapath comprising a non-block datapath; and logic, executed by theprocessor, for accessing the non-volatile storage component through thefirst datapath and the memory, wherein the non-volatile storagecomponent of the storage device is mapped to the memory.
 12. The systemof claim 11, the logic for mapping the non-volatile storage component tothe memory comprising logic, executed by the processor, for mapping aportion of dynamic random-access memory (DRAM) to host memory.
 13. Thesystem of claim 12, the logic for mapping a portion of DRAM to hostmemory comprising logic, executed by the processor, for mapping theportion of DRAM to memory using a Peripheral Component Interconnect(PCI) base address register (BAR) of the storage device.
 14. The systemof claim 13, the logic for mapping a portion of DRAM to host memorycomprising identifying a size of the portion of DRAM to map, theidentifying comprising computing a maximum portion size based on a powerloss capacitance supplied to the DRAM.
 15. The system of claim 13, thelogic for accessing the non-volatile storage through the first datapathand the host memory comprising accessing the non-volatile storagecomponent as a byte-addressable read cache or a byte-accessible writecache.
 16. The system of claim 11, the logic for accessing thenon-volatile storage component through the first datapath and the hostmemory comprising logic, executed by the processor, for writingjournaling data of an underlying data access command to the non-volatilestorage component.
 17. The system of claim 11, the logic for mapping anon-volatile storage component to the memory comprising logic, executedby the processor, for mapping a portion of a storage class memory (SCM)to the host memory.
 18. The system of claim 17, the logic mapping aportion of an SCM to host memory comprising logic, executed by theprocessor, for mapping one of a phase change memory (PCM), resistiverandom-access memory (ReRAM), or spin-transfer torque RAM (STT-RAM) tothe host memory.
 19. The system of claim 17, the logic for mapping aportion of an SCM to host memory comprising logic, executed by theprocessor, for mapping the SCM to the host memory via a controllermemory buffer (CMB) window.
 20. The system of claim 11, the programlogic further comprising logic, executed by the processor, for issuing ablock-based data access command via a second datapath after receiving aresult of accessing the non-volatile storage component through the firstdatapath.